Wrangling the Data¶
To begin my analysis, I first load the ward level shapefiles for Kathmandu valley into my environment. No public GIS data exists on political boundaries—however, in its absence, civillians have taken up charge on digitizing hence. All spatial files in this analysis is thanks to the hard work of Kiran Joshi, whose blog can be found here.
What is coloquially known as "Kathmandu" is in reality three cities. Legend says that three sons divided their father's kingdom into three parts to rule equally, creating the three cities of Kathmandu, Lalitpur, and Bhaktapur. But in spirit and in function, these cities work as one. Hence, my analysis will start by calling and combining the three shapefiles.
%%capture
import geopandas as gpd
import pandas as pd
ktm = gpd.read_file('/Users/avani/Documents/atlas/data/sfs/Kathmandu/Kathmandu.shp')
bkt = gpd.read_file('/Users/avani/Documents/atlas/data/sfs//Bhaktapur/Bhaktapur.shp')
ltp = gpd.read_file('/Users/avani/Documents/atlas/data/sfs//Lalitpur/Lalitpur.shp')
cities = pd.concat([ktm, bkt, ltp])
#cleaning up the center columnm as it contains neighbourhood name information
cities['Center'] = cities['Center'].str.replace(' Office$', '', regex=True)
hhwealth = pd.read_excel('/Users/avani/Documents/atlas/data/nepalcensus/HHwealth.xlsx')
After a bit of manipulation and cleaning, my final dataframe gives me ward level information on the total number of households, and the number of households that fall under each wealth quintile ranging from 1 to 5 with 1 being the most economically vulnerable households.
#creating new names for income quartiles
colnnames = [
'prov', 'dist', 'gapa', 'name', 'ward', 'tot_hh',
'quart1_count', 'quart2_count', 'quart3_count', 'quart4_count', 'quart5_count',
'quart1_%', 'quart2_%', 'quart3_%', 'quart4_%', 'quart5_%'
]
#dropping unnecesary values
hhwealth = hhwealth.iloc[3:].reset_index(drop=True)
hhwealth.columns = colnnames
#dataset contains random text values interrupting wards, hence removing such rows
numeric_columns = [col for col in hhwealth.columns if col != 'name']
for col in numeric_columns:
hhwealth[col] = pd.to_numeric(hhwealth[col], errors='coerce')
#only selecting the observations for the three municipal districts
valhhwealth = hhwealth[hhwealth['dist'].isin([28, 29, 30])]
valhhwealth.head()
As there are no spatial geometries present in the municipal data, we would want to join the census infomation to our ward-level map by the name of the municipality and their corresponding ward. But, having been phonetically converted to English, the municipalities are not spelled the same way across the dataset. Additionally, the data also contains unnecesary (and unstandardized) descriptors which will inhibit a join.
To address these issues, I remove all unneeded suffixes and match my datasets using a fuzzy criteria. Two neighbourhoods are considered the same if their names are up to 80% similar.
import warnings
warnings.filterwarnings('ignore')
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
#helper function to clean unnecesary descriptors that don't add to the analysis
def clean_name(name):
name = name.strip()
suffs = [
' Municipality',
' Metropolitian City',
' Metropolitan City',
' Mahanagarpalika',
' Nagarpalika',
' Nagarpalika VDC',
' Mun',
' Gabisa Karyalaya',
' Gabisa Bhawan',
' Gaunpalika',
' Sub Metro'
]
clean = name
for suffix in suffs:
if clean.endswith(suffix):
clean = clean[:-len(suffix)].strip()
return clean
# Cleaning the column containing municipality information
cities['Municipality'] = cities['Mun_Name'].apply(clean_name)
valhhwealth['Municipality'] = valhhwealth['name'].apply(clean_name)
# fuzzy matching accoridng to cleaned names
def find_match(name, choices, min_score=80):
best_match = process.extractOne(name, choices)
if best_match and best_match[1] >= min_score:
return best_match[0]
return None
# creating a mapping column of fuzzy matched names
name_mapping = {}
for name in cities['Municipality'].unique():
match = find_match(name, valhhwealth['Municipality'].unique())
if match:
name_mapping[name] = match
#adding those to the dataset
cities['new_name'] = cities['Municipality'].map(name_mapping)
# joining my census data to the spatial files
cities_inc = pd.merge(cities,
valhhwealth,
left_on=['new_name', 'Ward_No'],
right_on=['Municipality', 'ward'],
how='left')
#finally, converting my data to WGS84
cities_inc = cities_inc.to_crs('EPSG:4326')
Setting Limits for Kathmandu¶
Our analysis will not be very useful if we include sparsely populated hill wards in our dataset. Our study area needs to have tigther city limits—for this purpose I use a road density based selection criteria. Using road network data from Open Street Maps, I calculate the density of streets by area for each ward. I then filter out all wards who fall in the bottom 30th percentile for road density.
import osmnx as ox
from shapely.geometry import Polygon
warnings.filterwarnings('ignore')
#getting my valley shapefile
valley = cities_inc.dissolve().make_valid()
street_network = ox.graph_from_polygon(valley.geometry.iloc[0], network_type="drive")
#getting street edges
edges = ox.graph_to_gdfs(street_network, edges = True, nodes = False).to_crs(cities_inc.crs)
#counting number of streets per ward
cities_inc['streetcount'] = cities_inc.geometry.apply(
lambda x: len(edges[edges.intersects(x)])
)
#creating density metric
cities_inc['streetdensity'] = cities_inc['streetcount']/cities_inc['Area_SQKM']
#selecting 30th quartile threshold
threshold = cities_inc['streetdensity'].quantile(0.30)
#subsetting values
main_city = cities_inc[cities_inc['streetdensity'] >= threshold]
Visualizing Wealth Patterns¶
The resulting dataset allows us the visualize the distribution of households by their economic quartile across Kathmandu. Toggle with the layer selection on the top right to see how the percentage of households belonging to each economic quartile varies across the city.
import folium
m = main_city.explore(
column='quart5_%',
legend=True,
cmap='viridis',
name='Highest Income',
tooltip=['name', 'ward', 'quart5_%', 'quart4_%', 'quart3_%', 'quart2_%','quart1_%']
)
main_city.explore(
column='quart1_%',
cmap='inferno',
name='Lowest Income',
m=m ,
tooltip=['name', 'ward', 'quart5_%', 'quart4_%', 'quart3_%', 'quart2_%','quart1_%']
)
folium.LayerControl().add_to(m)
m
Estimating Income Distribution¶
Though the census data doesn't provide us with information on median household income, it is possible to estimate this data using other publicly available government data. The National Statistics Office has information on the median and mean income values for each quartile on their website.
I used this information to come up with weighted estimates of median income for each ward using the proportion of quantile distribution.
#getting the median income value for all economic quartiles
med_inc = {
'quart1': 242797,
'quart2': 299341,
'quart3': 377000,
'quart4': 465301,
'quart5': 617882
}
#calculating its weighted estimate based on ward level population
main_city['Est Income'] = (
main_city['quart1_count'] * med_inc['quart1'] +
main_city['quart2_count'] * med_inc['quart2'] +
main_city['quart3_count'] * med_inc['quart3'] +
main_city['quart4_count'] * med_inc['quart4'] +
main_city['quart5_count'] * med_inc['quart5']
) / main_city[['quart1_count', 'quart2_count', 'quart3_count',
'quart4_count', 'quart5_count']].sum(axis=1)
#rounding and cleaning the datatype
main_city['Est Income'] = pd.to_numeric((main_city['Est Income'])/100000).round(1)
Lets visualize this data to explore further.
m = main_city.explore(
column='Est Income',
legend=True,
cmap='viridis',
name='Estimated Income',
tooltip=['name', 'ward', 'Est Income', 'quart5_%', 'quart4_%', 'quart3_%', 'quart2_%','quart1_%']
)
folium.LayerControl().add_to(m)
m